This notebook is about classifying Challenge data according to type, and some cleanup of inconsistent terms. There are probably much better ways to classify


In [97]:
# The preamble
import pandas as pd
#pd.set_option('mode.sim_interactive', True)
import matplotlib.pyplot as plt
%matplotlib inline
import numpy as np
#from collections import OrderedDict
import json, csv
import re

In [72]:
df


Out[72]:
name entered values observer_id site_id type_id json
0 297 12/4/2016 16:57 {"observer": "", "label1": "cm of peat per yea... 1 204 37 {"model": "obs.observation", "fields": {"value...
1 298 12/4/2016 16:57 {"observer": "", "label1": "tons C per hectare... 1 205 37 {"model": "obs.observation", "fields": {"value...

load csv

This notebook records some data operations on Soil Carbon Challenge data to prepare it for insertion in atlasbiowork database. The munging and cleanup process will benefit from alternating between using Excel and notebook for inspection and cleanup.


In [54]:
#df = pd.read_csv('/Users/Peter/Documents/scc/challenge/PSQL/coverNEW.csv', parse_dates=['date'], usecols=['id','type','date','filename'])
#df = df.replace(np.nan,' ', regex=True)#MUST REPLACE NAN AS NAN IS NOT ITERABLE
#df = pd.read_csv('/Users/Peter/Documents/scc/challenge/datamungedOct2016c.csv')
df = pd.read_csv('/Users/Peter/Documents/atlas/atlasbiowork.com/atlasbiowork/db/changeREADYD.csv')
#df.dtypes

In [48]:
print df.to_json(orient='records')


[{"name":297,"entered":"12\/4\/2016 16:57","values":"{\"observer\": \"\", \"label1\": \"cm of peat per year\", \"value1\": \"-2\", \"label2\": \"\", \"value2\": \"\", \"label3\": \"\", \"value3\": \"\", \"start_date\": \"1\/1\/1923\", \"end_date\": \"1\/1\/2007\", \"description\": \"This post was put in to bedrock in 1923, in deep peat or muck soils. The top was level with the soil surface. Due to drainage and subsequent oxidation of these peat soils, the surface has subsided more than 1.5 m since 1923. In 1927 this area was drained for sugarcane production, and the loss of soil C has been estimated at 20 T C per hectare per year, assuming a bulk density of .22 and a carbon content of .11 g per cubic centimeter. (http:\/\/www.hos.ufl.edu\/vegetarian\/09\/Nov\/Soil%20Subsidence%20in%20the%20Everglades%20Agricultural%20Area.pdf) Soil subsidence in the Everglades area (http:\/\/www.grist.org\/i\/assets\/notill_and_C_sequestration.pdf) Agriculture, Ecosystems and Environment 118 (2007) 1-5\", \"photo1\": \"observations\/change_bellegladepost.jpg\", \"photo2\": \"\"}","observer_id":1,"site_id":204,"type_id":37},{"name":298,"entered":"12\/4\/2016 16:57","values":"{\"observer\": \"\", \"label1\": \"tons C per hectare\", \"value1\": \"-23\", \"label2\": \"\", \"value2\": \"\", \"label3\": \"\", \"value3\": \"\", \"start_date\": \"\", \"end_date\": \"\", \"description\": \"<p>The 2002 Biscuit Fire in southwest Oregon burned 200,000 hectares, including some soil research plots. After the fire, Forest Service scientists analyzed the soil layers in the plots, and documented an astounding loss of soil carbon and nitrogen, not only from the surface layer of organic matter, but in the mineral soil layers, where a majority of the losses occurred. An inch of soil, about 127 tons per hectare, disappeared, perhaps into the smoke plume, leaving a crust of rocks behind.<\/p><p>The loss of topsoil, carbon, and nitrogen from soil can negatively affect a range of processes, including nutrient retention and water infiltration. In the absence of nitrogen-fixing plants, such losses of nitrogen would require at least a century to be reversed, according to the researchers.<\/p><p>Wildfire <a href=\"http:\/\/asd-www.larc.nasa.gov\/biomass_burn\/globe_impact.html\">may contribute to climate change<\/a> in the short term, by releasing carbon and nitrous oxide as greenhouse gases and, in the long term, by reducing soil productivity through losses of organic matter and nutrients. With less productive soils, a forest will not grow as quickly nor reabsorb as much carbon as before a burn.<\/p><a href=\"http:\/\/rparticle.web-p.cisti.nrc.ca\/rparticle\/AbstractTemplateServlet?calyLang=eng&journal=cjfr&volume=38&year=0&issue=11&msno=x08-136\">Can. J. For. Res. 38(11): 2771-83 (2008)<\/a>\", \"photo1\": \"observations\/change_wildfire_biscuit.jpg\", \"photo2\": \"\"}","observer_id":1,"site_id":205,"type_id":37}]

In [41]:
df.to_csv('/Users/Peter/Documents/atlas/atlasbiowork.com/atlasbiowork/db/changeREADYD.csv', index=False)

In [23]:
df.values[5]


Out[23]:
array(['Bolinas Orchard', '12/4/2016 16:57',
       '{"observer": "Jeff Creque", "label1": "percent increase in organic matter per year, in top 12 inches", "value1": "6", "label2": "", "value2": "", "label3": "", "value3": "", "start_date": "1/1/1979", "end_date": "1/1/1999", "description": "From 2.5% organic matter in top 12 inches in 1979 to 8% in 1999", "photo1": "", "photo2": ""}',
       1L, 37L, 208L], dtype=object)

In [10]:
print df.values[1]


['Belle Glade, Florida' '12/4/2016 16:57'
 '{"observer": "", "label1": "cm of peat per year", "value1": "-2", "label2": "", "value2": "", "label3": "", "value3": "", "start_date": "1/1/1923", "end_date": "1/1/2007", "description": "This post was put in to bedrock in 1923, in deep peat or muck soils. The top was level with the soil surface. Due to drainage and subsequent oxidation of these peat soils, the surface has subsided more than 1.5 m since 1923. In 1927 this area was drained for sugarcane production, and the loss of soil C has been estimated at 20 T C per hectare per year, assuming a bulk density of .22 and a carbon content of .11 g per cubic centimeter. (http://www.hos.ufl.edu/vegetarian/09/Nov/Soil%20Subsidence%20in%20the%20Everglades%20Agricultural%20Area.pdfSoil subsidence in the Everglades area</a><br>(http://www.grist.org/i/assets/notill_and_C_sequestration.pdfAgriculture, Ecosystems and Environment 118 (2007) 1-5</a>", "photo1": "observations/change_bellegladepost.jpg", "photo2": ""}'
 1L 37L 204L]

save your work periodically but w. new filename


In [24]:
#df.to_csv('/Users/Peter/Documents/scc/challenge/datamungedOct2016d.csv', index=False)
df.to_csv('/Users/Peter/Documents/scc/challenge/obs_types/changeREADYB.csv', index=False)

In [12]:
#ADD SOME COLUMNS and DROP OTHERS
df = df[df['id']!='geometry']
df['obs_type'] = ""
df['newurl'] =""
df = df.drop('lon', axis=1)
df = df.drop('marker', axis=1)


Out[12]:
(6312, 20)

Using APPLY to classify according to keywords or strings


In [79]:
df = df.replace(np.nan,' ', regex=True)#MUST REPLACE NAN AS NAN IS NOT ITERABLE
#analysis_terms = ['analysis', 'density', 'average','variance','pH']
def mymunge(row): # a VARIETY OF CRITERIA
    if any( ['along' in row['note'], 'transect' in row['type'], 'grade' in row['type'], 'line' in row['type'], 'map' in row['type'], 'plot' in row['type'], 'remonitor' in row['type'], 'plants' in row['type']] ):
        return 'transect'
    if any( ['vert' in row['type'], 'ang' in row['type'], 'step back' in row['note'], 'hoop' in row['note'], 'Hoop' in row['note'], 'bare' in row['label1']] ):
        return 'cover'
    if any( ['infiltracion' in row['type'], 'infiltration' in row['type']] ):
        return 'infiltration'
    if any( ['analysis' in row['type'], 'composited' in row['type'], 'misc' in row['type'], 'density' in row['type'],'variance' in row['type'], 'average' in row['type'], 'pH' in row['type']] ):
        return 'analysis'
    if any( ['photo' in row['type']] ):
        return 'photo'
    if any( ['change' in row['type']] ):
        return 'change'
    if any( ['brix' in row['type'], 'biomass' in row['type'], 'clip' in row['type']] ):
        return 'food_analysis'
    
    
df['obs_type'] = df.apply(mymunge, axis=1)
#SEE HOW WE DID
print(len(df))
print(df.obs_type.value_counts(normalize=False, sort=True, ascending=False, bins=None, dropna=False))


6312
analysis         2256
transect         1513
cover            1341
infiltration     1065
photo             112
food_analysis      13
NaN                12
Name: obs_type, dtype: int64

In [79]:
df = df.replace(np.nan,' ', regex=True)#MUST REPLACE NAN AS NAN IS NOT ITERABLE
#analysis_terms = ['analysis', 'density', 'average','variance','pH']
def mymunge(row): # a VARIETY OF CRITERIA
    if any( ['along' in row['note'], 'transect' in row['type'], 'grade' in row['type'], 'line' in row['type'], 'map' in row['type'], 'plot' in row['type'], 'remonitor' in row['type'], 'plants' in row['type']] ):
        return 'transect'
    if any( ['vert' in row['type'], 'ang' in row['type'], 'step back' in row['note'], 'hoop' in row['note'], 'Hoop' in row['note'], 'bare' in row['label1']] ):
        return 'cover'
    if any( ['infiltracion' in row['type'], 'infiltration' in row['type']] ):
        return 'infiltration'
    if any( ['analysis' in row['type'], 'composited' in row['type'], 'misc' in row['type'], 'density' in row['type'],'variance' in row['type'], 'average' in row['type'], 'pH' in row['type']] ):
        return 'analysis'
    if any( ['photo' in row['type']] ):
        return 'photo'
    if any( ['change' in row['type']] ):
        return 'change'
    if any( ['brix' in row['type'], 'biomass' in row['type'], 'clip' in row['type']] ):
        return 'food_analysis'
    
    
df['obs_type'] = df.apply(mymunge, axis=1)
#SEE HOW WE DID
print(len(df))
print(df.obs_type.value_counts(normalize=False, sort=True, ascending=False, bins=None, dropna=False))


6312
analysis         2256
transect         1513
cover            1341
infiltration     1065
photo             112
food_analysis      13
NaN                12
Name: obs_type, dtype: int64

In [13]:
df = pd.read_csv('/Users/Peter/Documents/scc/challenge/obs_types/change.csv')

In [10]:
df = df.replace(np.nan,'', regex=True)#MUST REPLACE NAN AS NAN IS NOT ITERABLE
def buildValues(row):
    v = (row['url'], row['photo1'])
    return v
df['list'] = df.apply(buildValues, axis=1)
print list(df['list'])


[('change/bellegladepost.jpg', 'change_bellegladepost.jpg'), ('change/breton.jpg', 'change_breton.jpg'), ('change/gabebrown.jpg', 'change_gabebrown.jpg'), ('change/highplains.jpg', 'change_highplains.jpg'), ('change/holmepost.jpg', 'change_holmepost.jpg'), ('change/morrow.jpg', 'change_morrow.jpg'), ('change/questatransect09.jpg', 'change_questatransect09.jpg'), ('change/redbarnphoto2008.JPG', 'change_redbarnphoto2008.JPG'), ('change/RodaleFST.jpg', 'change_RodaleFST.jpg'), ('change/rosebudmine.jpg', 'change_rosebudmine.jpg'), ('change/rothamsted.jpg', 'change_rothamsted.jpg'), ('change/russellranch.jpg', 'change_russellranch.jpg'), ('change/salemroadpasture.jpg', 'change_salemroadpasture.jpg'), ('change/switchgrass.jpg', 'change_switchgrass.jpg'), ('change/twitchell.jpg', 'change_twitchell.jpg'), ('change/westwind.jpg', 'change_westwind.jpg'), ('change/wildfire_biscuit.jpg', 'change_wildfire_biscuit.jpg'), ('corc/corc1.png', 'corc_corc1.png'), ('fllp/fernroadMay2011 003.jpg', 'fllp_fernroadMay2011_003.jpg'), ('hjer/hjer1.png', 'hjer_hjer1.png'), ('jaff/ja 012.jpg', 'jaff_ja_012.jpg'), ('king/jan 2011 361.jpg', 'king_jan_2011_361.jpg'), ('king/jan 2011 390.jpg', 'king_jan_2011_390.jpg'), ('malm/MBEC3.png', 'malm_MBEC3.png'), ('mcne/mcne1.png', 'mcne_mcne1.png'), ('milt/Aug5-2011 008.jpg', 'milt_Aug5-2011_008.jpg'), ('morr/morr4.png', 'morr_morr4.png'), ('neum/neum1.png', 'neum_neum1.png'), ('prat/Jul2011 002.jpg', 'prat_Jul2011_002.jpg'), ('sunb/sunb1.png', 'sunb_sunb1.png'), ('sunb/sunb2.png', 'sunb_sunb2.png'), ('wall/wall1.png', 'wall_wall1.png'), ('work/jan 2011 371.jpg', 'work_jan_2011_371.jpg'), ('work/jan 2011 389.jpg', 'work_jan_2011_389.jpg'), ('work/jan 2011 390.jpg', 'work_jan_2011_390.jpg'), ('', ''), ('', ''), ('', ''), ('', ''), ('', ''), ('', ''), ('', ''), ('', ''), ('', ''), ('', ''), ('', ''), ('', ''), ('', ''), ('', ''), ('', ''), ('', ''), ('', ''), ('', ''), ('', ''), ('', '')]

In [14]:
df = df.replace(np.nan,'', regex=True)#MUST REPLACE NAN AS NAN IS NOT ITERABLE

def buildValues(row): # build the values json field
    v = '{"observer": "' + row['observer'] \
    + '", "label1": "' + row['label1'] \
    + '", "value1": "' + row['value1'] \
    + '", "label2": "' + row['label2'] \
    + '", "value2": "' + row['value2'] \
    + '", "label3": "' + row['label3'] \
    + '", "value3": "' + row['value3'] \
    + '", "start_date": "' + row['start_date'] \
    + '", "end_date": "' + row['end_date'] \
    + '", "description": "' + row['description'] \
    + '", "photo1": "' + row['photo1'] \
    + '", "photo2": "' + row['photo2'] \
    + '"}'
    return v
    
    
df['values'] = df.apply(buildValues, axis=1)
#SEE HOW WE DID
#print(len(df))
#print(df.obs_type.value_counts(normalize=False, sort=True, ascending=False, bins=None, dropna=False))
print df['values'][50]


{"observer": "", "label1": "tons C per ha per yr", "value1": "1.17", "label2": "", "value2": "", "label3": "", "value3": "", "start_date": "", "end_date": "", "description": "<p>As reported in <a href="http://www.researchgate.net/publication/272376062_Tillage_and_Cover_Cropping_Affect_Crop_Yields_and_Soil_Carbon_in_the_San_Joaquin_Valley_California">Agronomy Journal</a> (01/2015; DOI: 10.2134/agronj14.0415), reduced tillage with cover cropping trials with cotton and tomatoes in the San Joaquin valley resulted in increased soil carbon in the top 30 cm from 1999 to 2007. Conventional till and no cover cropping also showed slight increases.</p>", "photo1": "", "photo2": ""}

In [15]:
print df['values'][0:10]


0    {"observer": "", "label1": "cm of peat per yea...
1    {"observer": "", "label1": "cm of peat per yea...
2    {"observer": "", "label1": "No-till with cover...
3    {"observer": "Jeff Creque", "label1": "percent...
4    {"observer": "", "label1": "percent increase i...
5    {"observer": "", "label1": "percent increase i...
6    {"observer": "", "label1": "percent organic ma...
7    {"observer": "", "label1": "percent organic ma...
8    {"observer": "", "label1": "percent organic ma...
9    {"observer": "", "label1": "percent per year i...
Name: values, dtype: object

In [62]:
#df[df['obs_type'].isnull()]

In [47]:
df = df.replace(np.nan,' ', regex=True)#MUST REPLACE NAN AS NAN IS NOT ITERABLE
#analysis_terms = ['analysis', 'density', 'average','variance','pH']
def mymunge(row): # a VARIETY OF CRITERIA
    if any( ['along' in row['note'], 'transect' in row['type'], 'grade' in row['type'], 'line' in row['type'], 'map' in row['type'], 'plot' in row['type'], 'remonitor' in row['type'], 'plants' in row['type']] ):
        return 'transect'
    if any( ['vert' in row['type'], 'ang' in row['type'], 'step back' in row['note'], 'bare' in row['label1']] ):
        return 'cover'
    if any( ['infiltracion' in row['type'], 'infiltration' in row['type']] ):
        return 'infiltration'
    if any( ['analysis' in row['type'], 'composited' in row['type'], 'misc' in row['type'], 'density' in row['type'],'variance' in row['type'], 'average' in row['type'], 'pH' in row['type']] ):
        return 'analysis'
    if any( ['photo' in row['type']] ):
        return 'photo'
    if any( ['change' in row['type']] ):
        return 'change'
    if any( ['brix' in row['type'], 'biomass' in row['type'], 'clip' in row['type']] ):
        return 'food_analysis'
    
    
df['obs_type'] = df.apply(mymunge, axis=1)
#SEE HOW WE DID
print(len(df))
print(df.obs_type.value_counts(normalize=False, sort=True, ascending=False, bins=None, dropna=False))


6312
analysis         2261
transect         1513
cover            1276
infiltration     1065
photo             170
food_analysis      15
NaN                12
Name: obs_type, dtype: int64

In [117]:
df = pd.read_csv('/Users/Peter/Documents/atlas/atlasbiowork.com/atlasbiowork/db/changeREADYG.csv')
#print df.to_json(orient='records')

In [108]:
df.to_csv('/Users/Peter/Documents/atlas/atlasbiowork.com/atlasbiowork/db/changeREADYH.csv', index=False)

In [131]:
def buildJSON(row): # a VARIETY OF CRITERIA
    v=''
    v += 'Observation(values=' + row['values']
    v += ', observer_id=' + str(row['observer_id'])
    v += ', site_id=' + str(row['site_id'])
    v += ', type_id=' + str(row['type_id'])
    v += ').save()'
    return v
    
df['json'] = df.apply(buildJSON, axis=1)
df['json'].to_csv('/Users/Peter/Documents/atlas/atlasbiowork.com/atlasbiowork/db/changeK.csv', index=False)

search and replace in dataframe

First, get the strings you'd like to standardize. Then, enter your lists and replacement term. You have to target specific columns.


In [94]:
#GET STRINGS OF LABELS "list_of_terms"
a1 = df.label1.unique()
a2 = df.label2.unique()
a3 = df.label3.unique()
#a4 = df.label4.unique()
#a5 = df.label5.unique()
list_of_terms = np.concatenate([a1,a2,a3])
#sorted(list_of_terms)

In [4]:
#SEARCH AND REPLACE
#searchterm = 'lichen'
#replaceterm = 'squish'
#the_list = [item for item in list_of_terms if searchterm in str(item)]

#need to target specific columns

mycols=df[['label1', 'label2','label3','label4','label5']]
#mycols.replace('=','wox', regex=True)

#mycols.replace(to_replace=the_list, value=replaceterm, inplace=True)

In [12]:
#df = df.replace(np.nan,' ', regex=True)
df.label4[df.label4.str.contains('litter')]


Out[12]:
4      litter
5      litter
54     litter
55     litter
56     litter
140    litter
269    litter
Name: label4, dtype: object

In [16]:
searchterm = 'lichen'
replaceterm = 'moss/algae/lichen'
the_list = [item for item in list_of_terms if searchterm in str(item)]

#need to target specific columns

mycols=df[['label1', 'label2','label3','label4','label5']]
mycols.describe()

#mycols.replace(to_replace=the_list, value=replaceterm, inplace=True)


Out[16]:
label1 label2 label3 label4 label5
count 1308 1308 1308 1308 1308
unique 3 12 17 11 7
top
freq 745 940 864 1145 1228

In [6]:
searchterm = 'lichen'
replaceterm = 'moss/algae/lichen'
the_list = [item for item in list_of_terms if searchterm in str(item)]

#need to target specific columns

df[['label1', 'label2','label3','label4','label5']].replace(to_replace=thelist, value=replaceterm, inplace=True)


Out[6]:
['moss/algae/lichen',
 'moss/algae/lichen',
 '%lichen=',
 'moss/algae/lichen',
 'moss/algae/lichen']

In [29]:
transects.to_csv('/Users/Peter/Documents/scc/challenge/PSQL/transectsOct23.csv', index=False)

In [33]:
linephotos = df[(df.type.str.contains('line'))]
angphotos = df[(df.type.str.contains('ang')) | (df.note.str.contains('step back'))]
vertphotos = df[df.type.str.contains('vert')]
len(vertphotos)
#re.findall('\d+', s) #finds digits in s
def get_num(x):
    digits = ''.join(ele for ele in x if ele.isdigit())
    if digits:
        return int(digits)
    pass

#get_num('Hoop 1, 125\'')
#df.ix[459]
for y in range(len(df)):


Out[33]:
6312

In [63]:
#basic row selection from http://www.gregreda.com/2013/10/26/working-with-pandas-dataframes/ 
#which has GREAT info on joins
peter_plots = df[(df.lat > 0) & (df.observer.str.contains('Peter Donovan'))]
features = df[(df.type == '*plot summary')|(df.type == 'change')|(df.type.str.contains('remonitor'))];
#df.iloc[100] and df.ix[100] get the row referred to by the default 0-based index
# df.loc['Kellogg LTER'] doesn't work because it's not an index; 
# dfnew = df.set_index('id'); this works even tho id is not unique
#dfnew.loc['Kellogg LTER'] and this works; use inplace=True as arg to modify existing df
# dfnew.loc['BURR1'] returns all rows for this index

#column selector
#df[['type','label3']]; need to use double [] to enclose a list
#new column
#df['new'] = df.lat + 2

In [60]:
#df['featureID'] = df.id.str[0:6]  #str function slices string
#df.type  #although type is a keyword this works

In [58]:
#subsetting dataframes by finding rows

change = df[(df['type'] == "change")]
plots = df[df['type']=="*plot summary"]
peter = df[df.observer == "Peter Donovan"]
north = df[df.lat > 49] # gives 19 but df.lat > 49 gives all rows
type(change)


Out[58]:
pandas.core.frame.DataFrame

In [89]:
food_analysis = df[(df['obs_type'] == "food_analysis")]
food_analysis.to_csv('/Users/Peter/Documents/scc/challenge/obs_types/food_analysis.csv', index=False)